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Abstract: Let {(^i, ^i)}ig{i,...,n} be an i.i.d. sample from the random 
design regression model Y = f (x) + e with {X, Y) e [0, 1] X [-M, M]. In 
dealing with such a model, adaptation is naturally to be intended in terms 
of L^([0,l],Gx) norm where Gxi') denotes the (known) marginal distri- 
bution of the design variable X. Recently much work has been devoted to 
the construction of estimators that adapts in this setting (see, for example, 
[5, 24, 25, 32]), but only a few of them come along with a easy— to-implemcnt 
computational scheme. Here we propose a family of estimators based on the 
warped wavelet basis recently introduced by Picard and Kerkyacharian [•>()] 
and a tree-like thresholding rule that takes into account the hierarchical 
(across-scale) structure of the wavelet coefficients. We show that, if the re- 
gression function belongs to a certain class of approximation spaces defined 
in terms of Gx(0> then our procedure is adaptive and converge to the true 
regression function with an optimal rate. The results are stated in terms of 
excess probabilities as in [19]. 
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1. Introduction 

Wavelet bases are ubiquitous in modern nonparametric statistics starting from 
the 1994 seminal paper by Donoho and Johnstone ['27]. What makes them so 
appealing to statisticians is their ability to capture the relevant features of 
smooth signals in a few "big" coefficients at high scales (low frequencies) so 
that zero thresholding the small ones, results in an effective denoising scheme 
(see [47]). 

Although these well known results about thresholding techniques were usually 
obtained assuming a fixed (and possibly equispaced) design [27, 28], it was quite 
reassuring to see how they carry over almost unchanged to the irregular design 
case. As a matter of fact, in the case of irregular design, various attempts to 
solve this problem has been made: see, for instance, the interpolation methods 
of Hall and Turlach [u] and Kovac and Silverman [--SS]; the binning method 
of Antoniadis et al. [■]]; the transformation method of Cai and Brown [14], 
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or its recent refinements by Maxim [40] for a random design; the weighted 
wavelet transform of Foster [29]; the isometric method of Sardy et ah [44]; the 
penahzation method of Antoniadis and Fan [2] ; and the specific construction of 
wavelets adapted to the design of Dclouille et al. [21, 22] and Jansen et al. [46]. 
See also Pensky and Vidakovic [41], and the monograph [32]. 

The main drawback common to most of the methods just mentioned can be 
found, with no surprise, on the computational side: compared, for instance, with 
the usual thresholding technique, the calculations are, in general, less direct. To 
fix this problem, Kerkyacharian and Picard [-30] propose warped wavelet ba- 
sis. The idea is as follow. For a signal observed at some design points, Y{ti), 
i e {1, • . • ,2''}, if the design is regular {tk = k/2''), the standard wavelet de- 
composition algorithm starts with sj,fc = 2^^^Y{k/2'^) which approximates the 
scahng coefficient J Y{x)(j)j^k{x)dx, with 0j,fe(a;) = 2'^^'^cj){2'^ x — k) and </>(•) the 
so-called scaling function or father wavelet (see [39] for further information). 
Then the cascade algorithm is employed to obtain the wavelet coefficients dj^k 
for j ^ J, which in turn are thresholded. If the design is not regular, and we still 
employ the same algorithm, then for a function H{-) such that H{k/2^) = tk, 
we have sj^k = 2'^^^Y{H{k/2'^)). Essentially what we are doing is to decom- 
pose, with respect to a standard wavelet basis, the function Y{H{x)) or, if 
GoH{x) = x, the original function Y{x) itself but with respect to a new warped 
basis In the regression setting, this means replacing the stan- 

dard wavelet expansion of the function /(•) by its expansion on the new basis 
{'0j,fe(G(-))}(j_fe), where G(-) is adapting to the design: it may be the distri- 
bution function of the design, or its estimation, when it is unknown (not our 
case). An appealing feature of this method is that it does not need a new al- 
gorithm to be implemented: just standard and widespread tools. Of course the 
properties of this basis depend on the warping factor G(-). In [3G] the authors 
provide the conditions under which this new basis behaves, at least for statistical 
purposes, as well as ordinary wavelet bases with respect to L^([0, 1], dx) norms 
withp e (0, -|-oo). This condition properly quantifies the departure from the uni- 
form distribution and happens to be associated with the notion of Muckenhoupt 
weights (see [31, 45]). 

Now the problem is that we do not need good estimators in L^([0, 1], dx). 
What we need are (easy to compute) estimators that adapt in L^([0, l],Gx)- As 
a matter of fact it is possible to prove that the main results contained in [36] 
can be extended to this new setting once we assume Gx{-) to be known as in 
[15], the case of an unknown Gx{-) being beyond the scope of this work (see 
[35]). 

Here we propose a particular variation on the basic thresholding procedure 
advanced in [36], that can be motivated as follow. In a variety of real-life sig- 
nals, significant wavelet coefficients often occur in clusters at adjacent scales 
and locations. Irregularities, like a discontinuity for example, in general tend to 
affect the whole block of coefficients corresponding to wavelet functions whose 
"support" contains them. For this reason it is reasonable to expect that the 
risk of "blocked" thresholding rules might compare quite favorably with other 
classical estimators based on level-wise or global thresholds. The literature is 
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Fig 1. Examples of thresholding rules: [A] - Original wavelet coefficients; [B] - Linear 
thresholding; [C] - Nonlinear (hard) thresholding; [D] - Vertical (hard) thresholding. 

filled with successful examples of "horizontally" (within scales) blocked rules 
derived from both, purely frequentist arguments [11, 13, 15, 33], or Bayesian 
reasonings of some flavor [1, 16, 48]. Recently, an increasing amount of work has 
been devoted to study a new class of "vertically" (across scales, see Figure 1) 
blocked or treed rules [4, 10, 17, 30, 43], that have proved to be of invaluable 
help in at least two settings of great importance: the construction of adaptive 
pointwisc confidence intervals [42] and the derivation of pointwisc estimators 
of a regression function that adapt rate optimally under what we could call a 
focused performance measure [12]. 

For this reason, adapting some techniques developed in [-j] to the current 
(simplified) setting, in Section 2 we show how vertically zero-thresholding the 
warped wavelet coefficients actually results in an universal smoother with good 
properties in L^([0, 1], Gx) over reasonably large approximation spaces. 

2. Tree Structured Warped Approximations 

We shall now discuss in greater details nonlinear approximation processes based 
on warped wavelet bases where a tree structure is pre-imposed on the preserved 
coefficients. We will start following closely [.j] by reviewing some basic facts 
about partitions and how they are related to adaptive approximation. Then 
we present the universal algorithm based on adaptive partitions coming from a 
warped wavelet decomposition and its theoretical properties. 
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In the spirit of the recent paper by Cucker and Smalc [20], we will measure 
the performances of our estimator by studying its convergence both in prob- 
ability and expectation. More specifically, let P{-} be a - generally unknown 
or partially unknown- Borel measure defined on Z = A" x 3^ C M'* x K, and 
consider again a nonparametric regression problem where we want to estimate 
the conditional mean /(x) = E(y|X = x) from an i.i.d. sample of size n, 
z = z„ = {(xi,j/i)}jg{]^^ „}, drawn from P{-}. Assume further that, chosen an 
hypothesis space TL from which our candidate estimators /z(-) comes from, we 
shall measure the approximation error of /z(-) in the L^(A', Gx) norm, where 
Gx(-) is the (marginal) distribution of the design variable X. Here, as in the 
previous section, we will assume Gx(-) to be known. So, given /z G "K, the 
quality of its performance is measured by 

II./ - ./z|| = II,/ - ,/z||L2(Ar,Gx)- 

Clearly this quantity is stochastic in nature and, consequently, it is generally not 
possible to say anything about it for a fixed z. Instead we look at the behavior 
in probability as measured by 

P«"{z:||/-./z|| >r;}, ,,>0 

or the expected error 

E^"(ll/-/z||) = / ll/-/z||dP^", 

where P®"{-} denotes the n-fold tensor product of P{-}. Clearly, given a bound 
for P*^" {z : 11/ — /zjj > ry}, we can immediately obtain another bound for the 
expected error since 

E^"(II/-MI)=/ P^"{z:||/-/z|| >ry}dr;. (1) 
Jo 

As we will see in Section 4, bounding probabilities like P**"!-} usually requires 
some kind of concentration of measure inequalities (see [8]). 

Now, suppose that we have chosen a reasonable hypothesis space Ti. We still 
need to address the problem of how to find an estimator /z(-) for the regression 
function /(•). One of the most widespread criteria (see [20, 24, 32], and references 
therein) is the so called empirical risk minimization (least-square data fitting). 

Empirical risk minimization is motivated by the fact that the regression func- 
tion /(•) is the minimizer of 

£{w) = j [w{^)-yfdP. 

That is 

£{f) = inf £{w). 
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This suggests to consider the problem of minimizing the empirical loss 

1 " 
n ^-^ 

i=l 

over all w Cz Ti.. So, in the end, we found an implementable form for our candidate 
estimator 

/z = /z.H = argminfz(?i;), 

wen 

the so-called empirical minimizer. Notice that given a finite ball in a linear or 
nonlinear finite dimensional space, the problem of finding /z(-) is numerically 
solvable. 

In the following we will see how to build the hypothesis space H from refinable 
partitions of the design space X and then, how this is related to (warped) wavelet 
basis. Typically Ti. = Hn depends on a finite number J(n) of parameters as, for 
example, the dimension of a linear space or, equivalently, the number of basis 
functions we use to generate it. In many cases, this number J is chosen using 
some a priori assumption on the regression function. In other procedures, the 
number J avoids any a priori assumptions by adapting to the data. We shall be 
interested in estimators of the latter type. 

2.1. Partitions, Adaptive Approximation and Least-Squares Fitting 

We will now review some basic facts about partitions and how they are related 
to adaptive approximation. The treatment follows closely [-5]. A partitions A of 
X C [0, 1]'' is usually built through a refinement strategy. We first describe the 
prototypical example of dyadic partitions and then, in the following section, we 
will make the link with orthonormal expansions through a wavelet basis. So let 
X = [0, l]"*, and denote by Vj = T>j{X) the collection of dyadic subcubes of X 
of sidelength and T) = IJjlo -^i- These cubes are naturally aligned on a tree 
T = T{T>). Each node of the tree T is a cube I G 2?. If I S 2?^, then its children 
are the 2'^ dyadic cubes of J e I^j+i with J C I. We denote the set of children of 
I by C(l). We call I the parent of each such child J and write I = ^(J). The cubes 
in 'Dj{X) form a uniform partition in which every cube has the same measure 

More in general, we say that a collection of nodes T is a proper subtree of T 

if: 

• the root node I = A" is in T, 

• if I 7^ A" is in T then its parent 7^(1) is also in T. 

Any finite proper subtree T is associated to a unique partition A = A(T) which 
consists of its outer leaves, by which we mean those J £ T such that J ^ T 
but 'P(J) is in T. One way of generating adaptive partitions is through some 
refinement strategy. One begins at the root X and decides whether to refine X 
(i.e. subdivide X) based on some refinement criteria. If X is subdivided, then 
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one examines each child and decides whether or not to refine such a child based 
on the refinement strategy. 

We could also consider more general refinements. Assume, for instance, that 
a ^ 2 is a fixed integer. We assume that if X is to be refined, then its children 
consist of a subsets of X which are a partition of X. Similarly, for each such 
child there is a rule which spells out how this child is refined. We assume that 
the child is also refined into a sets which form a partition of the child. Such a 
refinement strategy also results in a tree T (called the master tree) and children, 
parents, proper trees and partitions are defined as above for the special case of 
dyadic partitions. The refinement level j of a node is the smallest number of 
refinements (starting at root) to create this node. Note that to describe these 
more general refinements in terms of basis functions, we need to introduce the 
concept of warped multi-wavelets and wavelet packets, but this is beyond the 
scope of the present work. 

We denote by Tj the proper subtree consisting of all nodes with level < j and 
we denote by Aj the partition associated to Tj, which coincides with 'Dj(X) in 
the above described dyadic partition case. Note that in contrast to this case, the 
a children may not be similar in which case the partitions Aj are not spatially 
uniform (we could also work with even in more generality and allow the number 
of children to depend on the cell to be refined, while remaining globally bounded 
by some fixed a) . It is important to note that the cardinalities of a proper tree 
T and of its associated partition A(T) are equivalent. In fact one easily checks 
that _ _ 

card(A(r)) = (a - 1) card(r) + 1, 

by remarking that each time a new node gets refined in the process of building 
an adaptive partition, card(T) is incremented by 1 and card(A) by a — 1. 

Given a partition A, we can easily use it to approximate functions supported 
on X. More specifically, let us denote by iSa the space of piecewisc constant 
functions - normalized in L^{X, Gx) - subordinate to A. Each f £ Sa can then 
be written as 

where 1|(-) denotes the indicator function of any set I C X. The best approxi- 
mation of a given function / S L^{X, Gx) by the elements of 5a is given by 

nA(/)(-)-E^i7^i'(-)> 

where 

and si = in case GxO) = 0. 

In practice, we can consider two types of approximations corresponding to 
uniform refinement and adaptive refinement. We first discuss uniform refine- 
ment. Let 

£,,(/) = 11/ -nA,,(/)||L2(Gx), JeNo, 
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which is the error for uniform refinement. The decay of this error to zero is 
connected with the smoothness of /(•) as measured in L^(A',Gx)- We shall 
denote by .4* the approximation space (see the review in [23]), consisting of all 
functions / £ L^(A',Gx) such that 

£]{.f) ^ Moa-^ JgNo. (3) 

Notice that card(Aj) = a'', so that the decay in Equation (3) is like N"'' with 
A'^ the number of elements in the partition. The smallest Mq for which Equation 
(3) holds serves to define the semi- norm l/U^ on The space A'' can be 
viewed as a smoothness space of order s > with smoothness measured with 
respect to Gx(-)- For example, if Gx(-) is the Lebesgue measure and we use 
dyadic partitioning then A''^'^ = B"^ , s € (0,1], with equivalent norms. Here 
B'^ is the Besov space which can be described in terms of the differences as 

||u.(- + /i)-u;(-)!lL^(d.) ^MQ\h\% x,h£X. 

Instead of working with a-priori fixed partitions there is a second kind of 
approximation where the partition is generated adaptively and will vary with 
/(•) . Adaptive partitions are typically generated by using some refinement 
criterion that determines whether or not to subdivide a given cell. We shall 
consider a refinement criteria that was introduced to build adaptive wavelet 
constructions such as those given by Cohen et al. in [17] for image compression. 
This criteria is analogous to thresholding wavelet coefficients. Indeed, it would 
be exactly this criteria if we were to construct a wavelet (Haar like) bases for 
L^(A', Gx)- For each cell I in the master tree T and any w € L^(A',Gx) we 
define 

Y Jec(i) 

which describes the amount of L^(A', Gx) energy which is increased in the pro- 
jection of w{-) onto 5a when the element I is refined. It also accounts for the 
decreased projection error when I is refined. If we were in a classical situation 
of Lebesgue measure and dyadic refinement, then vf{w) would be exactly the 
sum of squares of the (scaling) Haar coefficients oi w{-) corresponding to I. 

We can use v\{w) to generate an adaptive partition. Given any A > 0, let 
T(w^ A) be the smallest proper tree that contains all I e T for which v\{'w) ^ A. 
This tree can also be described as the set of all J G T such that there exists 
I C J which verifies v\{w) ^ A. Note that since w G L^(A', Gx), the set of nodes 
such that i'\{w) ^ A is always finite and so is T{w,X). Corresponding to this 
tree we have the partition A{w, A) consisting of the outer leaves of T{w, A). We 
shall define some new approximation spaces which measure the regularity of 
a given function w{-) by the size of the tree T(w, A). 

Given s > 0, we let B^ be the collection of all w e L^(A', Gx) such that the 
following is finite 

=sup{APcard(T(w,A))}, with p = (s + i)-\ (5) 
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We obtain the norm for B'^ by adding ||u'||l2(q_j^-, to |w|bs. One can show that 



\\W - nA(,„,;,)H|lL2(Gx) C{s) \w\^t+' < C{s) (6) 

where N = card{T{w, A)) and the constant C(s) depends only on s (see Cohen 
et al. [17]). It follows that every function w G B'^ can be approximated to order 
C(7V~*) by Ua{w){-) for some partition A with card(A) = TV. This should be 
contrasted with A'^ which has the same approximation order for the uniform 
partition. It is easy to see that S'* is larger than A"^ . In classical settings, the 
class is well understood. For example, in the case of Lebesgue measure and 
dyadic partitions we know that each Besov space B^ '' with t > {s/d + 1/2)"^ 
and q € (0, oo] arbitrary, is contained in B^^''' (see [17]). This should be compared 
with the A'^ where we know that ^^/'^ = B"^^ as we have noted earlier. In the 
next section we will see how to "visualize" these approximation spaces when we 
use warped wavelet bases to build our partitions. 

Until now, we have only considered the problem of approximating elements of 
some smoothness class by approximators associated to (adaptive) partitions of 
their domain X: no data, no noise; just functions. Here, instead, we assume that 
/(•) denotes, as before, the regression function and we return to the problem of 
estimating it from a given data-set. Clearly, we can use the functions in 7i = iSa 
for this purpose, so that the "incarnation" in this context of what we called the 
empirical minimizer, is given by 



1 " 

/z,A argmin - [w{-k^) - y^f , 



the orthogonal projection oi y ~ ?/(x) onto 5a with respect to the empirical 
norm 



1 



i=l 

with j/(xi) ~ yt, and we can compute it by solving card (A) independent prob- 
lems, one for each element I e A. The resulting estimator can than be written 
as 

leA 

where, for each I £ A, 

n 1 ^ 

■S|(z) ^-J^yi Iv^ rn ^l(^') Gx,„(l) = - Vl|(Xi), 



i=l 



are the empirical counterparts of the theoretical coefficients defined in Equation 
(2). With the coefficients {si(z)}igA at hand, we can build linear estimators /z(-) 
corresponding to uniform partitions with cardinality suitably chosen to balance 
the bias and variance of /z(') when the true regression function /(•) belongs to 
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Algorithm: Least— Squares on Adaptive Partitions 

Require: Sample z = {(xi,»/i)}ig{i „j; threshold An, 7 > smoothness index 

Output: An estimator /z(-) for the regression function /(■) 

Setup: 

1 : Define J* = min |j G N ; 2^ s: A"^/^} 
Generator: 

2 : Compute i^i(z) for the nodes I at a refinement level j < J* 

3 : Threshold {i^|(z)}| at level An obtaining the set: E(z,n) = {I G 7j* : > An} 

4 : Complete S(z,n) to a tree T(z,n) by adding nodes J D I G E(z,n) 

7 : Return The estimator /z(-) that minimizes the empirical risk on A(z,n) 

Table 1 

Least-squares on adaptive partitions driven by ttie empirical residuals i^;(z) defined in 
Equation (7). Adapted from [ >]. 

some specific smoothness class. Alternatively, defining the empirical versions of 
the residuals introduced in Equation (4) as 



we can mimic the adaptive procedure introduced in the previous section (see 
Table 1) to get universal^ estimators based on adaptive partitions. These par- 
titions have the same tree structure as those used in the CART algorithm [9], 
yet the selection or the right partition is quite different since it is not based 
on an optimization problem but on a thresholding technique applied to to em- 
pirical quantities computed at each node of the tree which play a role similar 
to wavelet coefficients as we will see in the following (see [2(i] for a connection 
between CART and thresholding in one or several orthonormal bases). 

2.2. A Universal Algorithm Based on Warped Wavelets 

The choice we made in the previous Section of adopting piecewise constant 
functions as approximators, severely limits the optimal convergence rate to ap- 
proximation spaces corresponding to smoothness classes of low or no pointwise 
regularity (see [(>] for an interesting extension based on piecewise polynomial 
approximations). A possible way to fix this problem would be to use the com- 
plexity regularization approach for which optimal convergence results could be 
obtained in the piecewise polynomial context (see for instance Theorem 12.1 in 
[32], and the paper by Kohler [37]). 

In the present context where the marginal design distribution Gx(') is as- 
sumed to be known, we have another option based on the warped systems in- 
troduced in Section 1. 

It is worth mentioning that in this section we will concentrate on the X = 
[0, 1]. The present setting could be generalized to the case where Gjf (■) is a d- 

synonymous of "adaptive": the estimator does not require any prior knowledge of the 
smoothness of the regression function /(■). 
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dimensional tensor product. However, the full generalization to dimension d > I 
is more involved and will not be discussed here. 

To "translate" the concepts highlighted in the previous two sections in terms 
of warped systems, consider a compactly supported wavelet basis {ipj,k{'),j ^ 
— 1, k € Z}, where ip-i^ki') = 4'a.ki') denotes the scaling function, and its warped 
version 

{V'i,fc(Gx(-)):J ^ e Z}. Then, for each / e L^([0, 1], Gx), consider 

its expansion in this basis 

/(a;) = ^<ij,fcV'j,fc(Gx(a;)). 

j,k 

In this context, a tree is a finite set T of indexes {j,k), j e Nq and k G 
{0,...,2^ - 1}, such that (j, fc) G T implies (j - 1, [k/2\) G T, i.e., all "an- 
cestors" of the point {j, k) in the dyadic grid also belong to the tree. 

One can then consider the best tree- structured approximation to /(•), by 
trying to minimize 

over all tree T having the same cardinality N, and all choices of dj^^- However 
the procedure of selecting the optimal tree is costly in computational time, in 
comparison to the simple reordering that characterize the classical thresholding 
procedure described in the previous section. A more reasonable approach is to 
use suboptimal tree selection algorithms inspired by the adaptive procedure 
introduced before. In detail, we start from an initial tree Tq = {(0,0)} and let 
it "grow" as follow: 

1. Given a tree TJv, define its "leaves" £(Tjv) as the indexes (j, k) ^ T/v such 
that (j-l,[fc/2j)Grjv. 

2. For {j,k) G C{Tm) define the residual 



with Ij-fe = [2-^fc,2-J(fc + 1)]. 

3. Choose {jo,ko) G C[Tm) such that 

^]o,ko = max j/j fe, 
U,k)ec(TM) 

4. Define Tn+i = Tjv U {(jo, fco)}- 

Note that this algorithm can either be controlled by the cardinality N of the 
tree, or by the size of the residuals as in Table 1 . 

Now, let A be the dyadic partition associated to any such tree, and define 

nA(,/)(a;) = ^diV^i(G(x)), with d| = (/,V^i(G))l2(g,) , 
leA 
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and their empirical counterparts 

1 " 

/A,z(a-) = Vrf|(z)Vi(G(x)), with d|(z) = - Vr,Vi(G(X,)). 
z — ' n -"^ — ' 

leA i=l 

Then, by adapting the techniques used in [5], in Section 4.1 we prove the fol- 
lowing result for uniform partitions: 

Theorem 2.1. (Optimality for imiform partitions) Assume that f G A'^ and 
define the estimator fz = /z.Aj* ; with 

r = J*(n) = min {j G N : 2^^^+'^^ ^ T^}- 
Then, given any f3 > 0, there is a constant c such that 



and ^ 

E^" {if ^ (C+ |./|^.) {^)^, 

where C depends only on M . 

Theorem 2.1 is satisfactory in the sense that the rate 

known to be optimal (or minimax) over the class A'^ save for the logarithmic 
factor. However, it is unsatisfactory in the sense that the estimation procedure 
requires a-priori knowledge of the smoothness parameter s which appears in 
the choice of the resolution level j. Moreover, as noted before, the smoothness 
assumption / G A" is too severe. Consequently, our next task, will consist in 
deriving a method capable of treating both defects. To this end, mimicking 
Equation (7), we define the empirical residuals as 



IS 



Then, for some k > 0, let 



n 

be a given threshold. Now, adapting the algorithm given in Table 1, assume that 
the estimator /z(-) is generated as detailed in Table 2. Then, in Section 4.2, we 
prove the following 



^ft is essentially a smoothing parameter to be selected by cross— validation, for instance). 
Notice that in our theoretical developments wc will only assume that k is "large enough" to 
ensure the desired concentration inequalities. 
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Algorithm: "Treed" Approximations from Warped Wavelets 

Require: Sample z = {(xi,j/i)}ig{i threshold A,i, 7^5 smoothness index 

Output: An estimator /z(-) for the regression function /(■) 

Setup: 

1 : Define J* = min {j e N : 2J ^ A"^/^} 
Generator: 

2 : Compute Uj^ii{z) for any node [j, k) at a refinement level j < J* 

3 : Threshold {uj ^.(z)}j k at level An obtaining: E(z,n) = {{j,k) £ Tj* ; Uj f^{z) ^ A„} 

4 : Complete S(z,n) to a tree T(z,n) by adding nodes (i,va) £ ^({(i, 's)}) 
for all (j,k) e S(z,n) 

7 : Return The estimator }^{-) = J2{j,k)eA(z,n) '>'3,k{^)'^j,k{Gx{-)) 

Table 2 

Tree-structured approximations from warped wavelet decompositions. 



Theorem 2.2. (Optimality for "growing" adaptive partitions) Let /3 and -f > ^ 

be arbitrary. Then, there exists k > 0, such that, whenever f S A'^HB'^ for some 
s > 0, the following inequalities hold 





and 



logn] 2s+l 



where the constants c and C do not depend on the sample size n. 

Theorem 2.2 is definitively more satisfactory than Theorem 2.1 in two re- 
spects: 

-.•s/(2s+l) 



The optimal rate 



lQg(") 



is now obtained under weaker smooth- 
ness assumptions on the regression function, namely, f € in place of 
/ S , with the extra assumption / G A'' with 7 5 arbitrary. 
• The estimator we obtain is adaptive (universal), in the sense that the 
value of s does not enter the definition of the algorithm. The procedure 
automatically extract information about the regularity of the regression 
function from the data at hand. 

It is interesting to notice that in standard thresholding (standard denoising or 
density estimation, for instance) one usually sets the highest level J* so that 
2'' n/log(n); here we have to stop much sooner, namely, 2^^ ^ yjnj log(n), 
as in [36]. This is especially necessary to obtain the exponential inequalities in 
Section 4.1 and 4.2. 

A final remark on the approximation spaces A^ and B^ is in order. In a 
previous section, we mentioned that, when Gx(S) is the Lebesgue measure, then 
the spaces A^ and B^ are well understood. In particular, each Besov space B'^^ '^ 
with T > (s 4- 1/2)^^ and q E (0, +00], is contained in B^ (see Cohen et al. 
[17, 18]), whereas A^ = B^ . For general partitions it is not totally clear how 
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to express the content of these approximation spaces in terms of reasonably 
simple variations of common smoothness classes. Things get slightly simpler 
when we employ a warped wavelet basis to generate the partition. As a result, 
we can map approximation properties imposed on /(•) to regularity properties 
over its warped version / o Gx^{-). So, assuming / g is equivalent to impose 
/ o Gj} € as soon as A'^ is defined in terms of warped wavelets. 



3. Discussion 



The dependence on the design marginal G'x(-) is so far a clear weakness of our 
approach from both a theoretical and a practical perspective. Nevertheless, an 
obvious option to extend our tree-structured procedure to the case of an un- 
known Gx{■)^ would probably end up combining the arguments introduced in 
[5] with those considered by Kerkyacharian and Picard in [:>fi] and [■!")]. An- 
other practical option might be to adopt a split sample approach and measure 
smoothness in terms of the discrete norm induced by the data. Here we also 
mention the fact that in Theorem 2.2 we require the knowledge of the param- 
eter 7 which can be arbitrary close to 1/2. As in [^], it is probably possible to 
remove the dependency on 7 at the price of using the much more complicated 
construction proposed by Binev and DeVore in [7]. 



4. Proofs for Section 2 



4.1. Proof of Theorem 2.1 



For any given partition A, a natural way to control ||/ — /z. 



A r2 



L"(Gx) 



is by 



splitting it into a bias and variance term denoted respectively with ei and 62 in 
the following equation 

II./ - UAhiG^) = 11/ - ^AmlHG.) + l|nA(/) - f.Ah(G^) = ei + e2. (8) 

ei will be controlled by using the smoothness assumptions we made in the 
statement of the theorem, whereas the variance term 62 will be controlled by 
Bernstein's inequality. 

Lets start with this second step observing that, by denoting [d\ — d|(z)] with 
A|(z), then by orthonormality of the warped system we have 



^^^^ [rf|-d|(z)]V^,(Gx(-)) 



y A2(z). 



l|nA(/) - /z,a|Il2(Gx) 

Hence, for any ?7 > 0, 

P^" {l|nA(/) - /.,a||l2(g.) > v] = P^" {E,eA^' > ^ 

^card(A).P«"{Af(z)>^} = 
^card(A).p-{|A,(z)|>-^ 
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Consequently to control 62 wc just need to control |A|(z)| and the cardinality 
of A. Now, if we define U = Yipj^k{X), then 

• ||C/-E{;7}||oo < 2-2J"/2| 



oo\\J II 00 : 



• a 



i{|c/-E(c/)|2}<E{|t/p}«; 11/11^, 



as 



E{|^,-fc(G(X))|2} = / |^,-fc(G(x))pdGx(x) = / |V',,fc(G(G-i(y)))fdy 



|V,,fc(y)pdy = l. 



Hence, for any > 0, by Bernstein's inequality we get 



^ 2cxp 



2 card (A) 

3n 



V"rd(A) 



C'card(A)(3 + J7) 



(10) 



where G' = 2 max{||/||^, 2||?/'||oc-j|/||oo}! and the last inequality comes from the 
fact that for any I e A wc have 2^ = |l|^^ ^ card(A) = 2'' for some J e N, being 
A a dyadic partition. 

Now, back to our specific case. First of all remember that, by definition, 

J* = r{n) = min {j e N : 2=^^+^^^ ^JE^.]. 



so 



log(") 



card(AjO 2'' +^ =^ 2-^2'^ =^ 2 



Hence, by definition of ^ we get the following bound for ei: 



(11) 



ll/-nA,,.(/)||L^(G,,) ^ l/U=2-'*^ ^ i/u. 

From Equation (8) we then get 



lQg(") 



l+2s 



11/ - /z,A,,. IIl2(Gx) ^ I'^l^" 

therefore, for all 5 > 



log(") 



2s 

1+2S 2 

+ ^^2 5 



(12) 



p^" {II./ - /z,A,,. ||l^(g.) ^ ^} p^" 62 ><5 - i/u. 



log(") 
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'"^^"■^ ^^"""^ as in the statement of Theorem 2.1, and 

log(K) 2s+l 



applying Equations (9) and (10) noticing that 
we obtain 



< 1 for every s > 0, 



2card{Aj* ) cxp 



n r log(n) ] l-(-2s \ 3 e- 

ard(Aj* ) L n J ] C'(3+g) 



But from Equation (11) we know how to bound the cardinality of our partition, 
therefore 



n r log(n) 1 i + 2s \ ^ ^ 

card(A,,*) L n J j C'{3 + c) 



3c- 



s2 



1 2s 
pog(ra) ] l + 2s r log(") 1 l + 2s 

L " J / C'(3 + c) 



3(c) ■ log{n), 



with 
so that 

p^"{I1/-/za,.I1l^(Gx)^'5} 



5(5) 



3 c- 



;;2 



4C"(3 + c)^ 



3 (gin 



^ 2-2^ 



62 > C 



log(") 



l+2s 



l+2s 



exp • 



{log [n"f(^'] } 



where the last inequahty holds as soon as g{c) — 1 > /3. And this complete the 
proof since from here we can easily derive a bound for the risk by using Equation 

(!)• 



4.2. Proof of Theorem 2.2 

Lets start with a bit of notation. First of all, for each A > 0, we will denote by 

• T(/, A): smallest tree which contains all dyadic intervals I such that v\ > A. 

• A(/, A): partition induced by the outer leaves of T(/, A). 

• T(/, A,z): smallest tree which contains all dyadic intervals I such that 
i^\{z) > A. 

• A(/, A, z): partition induced by the outer leaves of T(/, A, z). 

If Aq and Ai are partitions associated to the tree Tq and 7i, then we denote by 

• Aq V Ai the partition associated to the tree Tq U 7i, 

• Aq A Ai the partition associated to the tree Tq n 7i. 
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Finally, let A„ = '"^i""* foi' some k > 0, and 

J* = min{j e N : 2^ K^^''}- 
Then for each A > 0, define the partitions 



. A(A)=A(/,A)AA,p, 
• A(A,z) = A(/,A,z) AAj*. 



Therefore, in this section, we consider the adaptive estimator 

/z,n(a;) = /z,A(A„,z)(a;) = ^ (ii(z)V'i(Gx(a;)). 

leA{r„,z) 

Lets now start the proof observing that, using the triangle inequality, we can 
decompose the loss as follow 

11/ - /z.ri||L2(Gx) = ei + 62 + 63 + 64, 

where 

• ei = 11/ - nA(A„,z)vA(2A„)(/)llL2(Gx)' 

• 62 = ||nA(A„^z)vA(2A„)(/) - nA(A„,z)AA(2-iA„)(/)|!L2(Gx)' 

• 63 = ||nA(A„,z)AA(2-iA„)(/) - /z,A(A„,z)aA(2-ia„)IIl2(Gx)' 

• 64 = ||/z,A(A„,z)aA(2-ia„) - /z,A(t„,z)|l2(Gx)- 

This type of splitting is frequently used in the analysis of wavelet thresholding 
procedures to deal with the fact that the partition built from those I such that 
v\{z) ^ A„, does not exactly coincides with the partition which would be chosen 
by an oracle based on those I such that i'\ ^ A„. This is accounted by the 
terms 62 and 64 which correspond to those dyadic interval I such that i'\{z) is 
significantly larger or smaller than i>\ respectively, and which will proved to be 
small in probability. The remaining terms ei and 63 correspond respectively 
to the bias and variance of oracle estimators based on partitions obtained by 
zero-thresholding based on the unknown quantities 

The first term ei, being a bias, is treated by a deterministic estimate as 
in [')]. More specifically, since A(A„,z) V A(2A„) is a refinement of A(2A„) = 
A(/, 2A„) A Aj* , we have (almost surely): 

ei S$ ll/-nA(2A„)(/)llL2(Gx) ^ 

11/ - nA(/,2A„)(/)llL=(Gx) + l|nA(/,2A„)(/) " nA(2A„) (/) II L^(Gx) 
ll/-nA(/,2A„)(/)||L^(Gx) + ll/-nA,,.(/)||L^(Gx) ^ 

s; c(,s)[2A„]5I?t|/|b. +2-^-'*|/U. 

s; C(s)[2A„]2ItT|/|b. +2-TA„|/U.. 



Therefore 



2s 



61 ^ C{s) (2k)2s+i + 2-'k } max{ l/U,, |/|e4 



log(") 



2s+l 

= Cl 



log(») 



2s+l 
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as soon as / G B'' n , with ci = C(s) |(2k)2s+i + 2Tk| max { |/U-, , |/|b4. 

The third term 63 is treated by the estimate provided by combining Equations 
(9) and (10) 



P^"{e3 > ?/} 2card(A3)exp 



3ri 77^ 



C'card(A3)(3 + 77) J ' 
where A3 = A(A„, z) A A{2^^X„). So 

card(A3) card {A{2-^X„)) = card (A(/, 2"1A„) A Aj.) 



(13) 



^ card (A(/, 2^iA„)) < {2-'X„)-p \f\%. = 2^ A„ '+'^ \f\%. 

2 



2PK-1+2S \f\P 



' 2(l+2s) 



C3 



log(") 



1 

" l+2s 



(14) 



where we have used the fact that 1/p = 1/2 + s. 

For the remaining two terms, £2 and 64 we will show that V /3 > we fix, 
there exists a constant C" > such that: 



P^" {62 > 0} + P^" {64 > 0} s; C n 

Before we prove this, lets show why it is sufiicient. Let < (5 = c 
as in the statement of Theorem 2.2. Then we have 



log(") 



(15) 

1 

l+2s 



P®"{ll/-/.,n|lL2(G,.,) ^'5} =S P®" {61+62+ 63 +64^-5}^ 

^ P®"{e2 + e3 + e4>(c-ci)[iHSM]2^| 



^ P®"{e2 >0} + P®"{e4 >0} + P®"{e3 ^5} 

byEq.(15) , a ^ t --, 

s£ C'n-I^ + P®" {eg 5} , 



log(") 



2s+l 



where 5 = (c — Ci ) 

we needed in Section 4.1, from Equations (13) and (14), we obtain 

2s 



where 



card (A3 



log(") 



Repeating the steps used to derive the bound 
) 

3(c-ci)2 



C'[3+(c-ci)] 



> 



> 







1 




2s \ 




log(n) 


l+2s 


log(n) 


l+2s \ 




n 




n 





g{£) ■ log(n). 



5(5) 



3(5-ci)^ 



C3C'[3+(c-ci) 



3(5- ci)^ 
C'[3+(£-ci)] 
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Therefore 



log(") 



c'n" 



as soon as g{c) — 1^/3. And this would conclude the proof of Theorem 2.2. 
We need to prove Equation (15). The main tool is the following lemma 

Lemma 4.1. For each I £ Aj*, we have 



where 



P^'' {{i^iiz) ^ A„} n {,.1 > 2A4) < 4n-sM, 
F^^ ({i/,(z) ^ A„} n {v, ^ 2-iA„}) An-3i-) 



din) 



8C" ( 3 + K 27 



Before we prove Lemma 4.1, lets show why this is sufficient. Remember that 

62 = |!nA(A„,z)vA(2A„)(/) - nA(A„,z)AA(2-i A„) (/) II (Gx) • 

Consequently 



• 62 = if T(A„, z) U T(2A„) = T(A„, z) n T(2-iA„), 

• 62 > if 



r(A„,z) U T(2A„) D T(A„,z) n r(2 ^A„) 



T(A„,z) (z: r(2~iA„) 
or 

T(2A„) (z: T(A„,z) 

{i/i(z) sS A„} n > 2A„} 
<^ 3 I s.t. ^ or 

{i/|(z) 5! A„} n ^ 2-iA„} 



Therefore 

{62 > 0} sc: ^ ({;.|(z) ^ A„} n {i., ^ 2A„}) 



(16) 



leAj. 



J2 P""" ({'^i(z) ^ A„} n {i^i ^ 2-iA„}) = i?i + i?2. 

IGAj* 



Then, by applying the first part of Lemma 4.1, we get 
Ri ^ card(A,j*)4n-f("' =^ card(Ao)2-'*4n-f('") 



(17) 



< card(Ao) A;;i/'^4 ^'^^ card(Ao) 



1/7 



log(ji 



27 4 



^ card(Ao) 4 n^^^'^) = C" i 

and analogously, by the second part of Lemma 4.1, we obtain 



i?2 s; c'i 



(18) 
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Applying again the second part of Lemma 4.1, we are also able to bound 64 
as follow 

{64 >0}^J2 ({''i(^) ^ ^ ^ 2-iA„}) < C n L ^ 
iga,. 

(19) 

Combining Equations (16), (17), (18), and (19), we complete the proof of Equa- 
tion (15). In fact, given /3 and 7 ^ can find k such that the theorem 
holds. 

4-2.1. Proof of Lemma 4 • 1 

Lets starting noticing that, for each ?7 > 0, 

{i^l(z) < 77} n {i^i ^ 2ri} C {|iy|(z) - z^il ^ 77} , 

hence 

({i.i(z) < 77} n {t^i ^ 2,;}) < {|i.,(z) - ;^ 77} ■ 

In addition 



y df{z)-jy df 



l|d(z)||2-||d||2| < 



|zy|(z)-i^i| = 

||d(z)-d||/^='V[d,+ (z)-d,+]%[d,-(z)-d,-]^ 

where 1^ and P denote respectively the left and right child of I. So 



2 2_ 

2„ 



{h(z)-.,|^77}^{h(z)-.,p^7,2} <= It'll"?'!!? ,r 



|rf,-(z)-d,-|^^ ■ 



Therefore 

({i.|(z) ^ A4 n ^ 2A„}) P«"(| A,+ (z)| + P«"(| A,- (z)| ^-^ 

If we now take 77 = by applying the Bernstein's inequality as in Section 

4.1, for J e {I"*", r} we obtain^ 

3®" /^lA .^„M >c . / k^ ^ ^ o J SK^login) 



P«" |Aj(z)| ^ ^^isaiii U2cxp p= ^ 

^ V2V 7 1 2^,^3^2b + i)/22-l/2^^i£E(n)] ' 



^ 2 exp < — 



log(7i) 



8C' [3 + 2^/2^^12^] 



^Compare with the proof of Proposition 3 in [36]. 
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Now, by hypothesis, we know that 



2^ € A-i/-^ 



K y log(n) 



log(n) 



27 



therefore 



2j72 



« V log(n) 



K V Iog(?i) 



1 J- ^1 



hence 



P^" ( |Aj(z)| ^ 



V2 



log(") 



^ 2 exp ■ 



log(r7,) 



8C"(3 + K 27) 
2exp{-g(z) log(n)} = 2n'Si^\ 



with 



So finally 



8C" ( 3 + 27 



({;.|(z) 5: A„} n {1^1 ^ 2A„}) 4n-s(-l 
Now, lets evaluate the other term in a similar manner, starting from 

P«"({z.|(z) ^ A„}n^i «;2-iA„}) < P«"{|z.|(z)-i.|| ^2-iA„} 

Je{i+,i-} 

By the same arguments adopted before, we see that, for each J e {1^, P}, 

P«"(|Aj(z)| s;2n-f(^), 

and consequently 

P«" ({i.|(z) ^ A„} n {j., ^ 2-iA„}) 4n-s(^). 

References 

[1] F. Abramovich, P. Besbeas, and T. Sapatinas. Empirical Bayes approach 
to block wavelet function estimation. Computational Statistics and Data 
Analysis, 39:435-451, 2002. 

[2] A. Antoniadis and J. Fan. Regularization of wavelet approximations. Jour- 
nal of the American Statistical Association, 96:939-967, 2001. 



imsart-ejs ver. 2008/01/09 file: ejs_2008_175.tex date: February 2, 2008 



p. Brutti/Warped Wavelet and Vertical Thresholding 



21 



A. Antoniadis, G. Gregoire, and P. Vial. Random design wavelet curve 
smoothing. Statistics and Probability Letters, 35:225-232, 1997. 
F. Autin, D. Picard, and V. Rivoirard. Maxiset approach for Bayesian 
nonparametric estimation Mathematical Methods of Statistics, 1 5 (4): 349- 
373, 2006. 

P. Binev, A. Cohen, W. Dahmcn, R. DeVore, and V. N. Temlyakov. Uni- 
versal algorithms for learning theory part I: piecewise constant functions. 
Journal of Machine Learning Research, 6:1297-1321, 2005. 
P. Binev, A. Cohen, W. Dahmen, and R. DeVore. Universal algorithms 
for learning theory part II: piecewise polynomial functions. Constructive 
Approximation, 26(2):127-152, August 2007. 

P. Binev and R. DeVore. Fast computation in adaptive tree approximation. 
Numerische Math., 97(02:11):193-217, 2004. 

S. Boucheron, O. Bousquet, and G. Lugosi. Concentration inequalities. In 
O. Bousquet, U. v. Luxburg, and Ratsch G., editors. Advanced Lectures in 
Machine Learning, pages 208-240. Springer, 2004. 

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification 
and regression trees. Wadsworth International, 1984. 

P. Brutti. Variable bandwidth schemes for local polynomial smoothers via 
vertical wavelet thresholding. In S. Barber R.G. Aykroyd and K.V. Mardia, 
editors, Bioinformatics, Images, and Wavelets, pages 119-121. Department 
of Statistics, University of Leeds, 2004. 

T. Cai. Adaptive wavelet estimation: a block thresholding and oracle in- 
equality approach. The Annals of Statistics, 27:898-924, 1999. 
T. Cai and M. G. Low. Nonparametric estimation over shrinking neigh- 
borhoods: superefficiency and adaptation. The Annals of Statistics, 33(1), 
2005. 

T. Cai and W. Silverman. Incorporating information on the neighboring 
coefficients into wavelet estimation. Sankhyd, Series B, 63:127-148, 2001. 
Special issue on wavelets. 

T. T. Cai and L. D. Brown. Wavelet shrinkage for noncquispaccd samples. 
The Annals of Statistics, 26(5):1783-1799, October 1998. 

C. Chesneau. Wavelet block thresholding for samples with random design: 
a minimax approach under the L^ risk. Electronic Journal of Statistics, 
1:331-346, 2007. 

D. De Canditiis and B. Vidakovic. Wavelet Bayesian block shrinkage via 
mixtures of normal-inverse gamma priors. Technical Report RT 234/01, 
Istituto per le Applicazioni del Calcolo , Sezione di Napoli, 2001. 

A. Cohen, W. Dahmen, I. Daubechies, and R. DeVore. Tree approximation 
and optimal encoding. Applied Computational and Harmonic Analysis, 
11(2):167-191, 1999. 

A. Cohen, I. Daubechies, O. G. Guleryuz, and M. T. Orchard. On the im- 
portance of combining wavelet-based nonlinear approximation with coding 
strategies. IEEE Transactions on Information Theory, 48(7):1895-1921, 
2002. 

F. Cucker and S. Smale. On the mathematical foundations of learning 

imsart-ejs ver. 2008/01/09 file: ejs_2008_175.tex date: February 2, 2008 



p. Brutti/Warped Wavelet and Vertical Thresholding 



22 



theory. Bulletin. Amer. Math. Soc, 39:1-49, 2002. 

[20] Felipe Cucker and Steve Smale. On the mathematical foundations of learn- 
ing. Bull. Amer. Math. Soc. (N.S.), 39(l):l-49 (electronic), 2002. 

[21] V. Delouille, J. Franke, and R. von Sachs. Nonparametric stochastic re- 
gression with design-adapted wavelets. Sankhyd, Series A, 63(3):328~366, 
2001. 

[22] V. Delouille, J. Simocns, and R. von Sachs. Smooth design-adapted 
wavelets for nonparametric stochastic regression. Journal of the Ameri- 
can Statistical Association, 99(467):643-658, 2004. 

[23] R. Devore. Nonlinear approximation. Acta Numerica, pages 1-99, 1998. 

[24] R. DeVorc, G. Kerkyacharian, D. Picard, and V. N. Temlyakov. Mathe- 
matical methods for supervised learning. Research Report 04:22, Industrial 
Mathematics Institute, 2004. 

[25] R. DeVore, G. Kerkyacharian, D. Picard, and V. N. Temlyakov. On math- 
ematical methods of learning. Research Report 04:10, Industrial Mathe- 
matics Institute, 2004. 

[26] D. L. Donoho. CART and best-ortho-basis: A connection. The Annals of 
Statistics, 25(5):1870-1911, October 1997. 

[27] D. L. Donoho and 1. M. Johnstone. Ideal spatial adaptation by wavelet 
shrinkage. Biometrika, 81(425-455):425-455, 1994. 

[28] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet 
shrinkage: asymptotia. Journal of the Royal Statistical Society, Series B, 
57(2):301-370, 1995. with discussion. 

[29] G. Foster. Wavelet for period analysis of unequally sampled time series. 
Astronomy Journal, 112:1709-1729, 1996. 

[30] P. Fryzlewicz. Bivariate hard thresholding in wavelet function estimation. 
Technical Report TR-04-03, Department of Mathematics, Imperial College 
London, UK, 2004. 

[31] J. Garcia-Cuerva and J. M. Martell. Wavelet characterization of weighted 

spaces. Journal of Geometrical Analysis, 11:241-264, 2001. 
[32] L. Gyorfi, M. Kohlcr, A. Krzyzak, and H. Walk. A Distribution-Free Theory 

of Nonparametric Regression. Springer, 2002. 
[33] P. Hall, G. Kerkyacharian, and D. Picard. On the minimax optimality of 

block thresholded wavelet estimators. Statistica Sinica, 9:33-50, 1999. 
[34] P. Hall and B. A. Turlach. Interpolation methods for nonlinear wavelet 

regression with irregularly spaced design. The Annals of Statistics, 25:1912- 

1925, 1997. 

[35] G. Kerkyacharian and D. Picard. Thresholding in learning theory. Con- 
structive Approximation, 26(2):173-203, August 2007. 

[36] Gerard Kerkyacharian and Dominique Picard. Regression in random design 
and warped wavelets. Bernoulli, 10(6):1053-1105, 2004. 

[37] M. Kohler. Nonlinear orthogonal series estimates for random design regres- 
sion. Journal of statistical Planning and Inference, 115:491-520, 2003. 

[38] A. Kovac and B. W. Silverman. Extending the scope of wavelet regression 
methods by coefficient-dependent thresholding. Journal of the American 
Statistical Association, 95:172-183, 2000. 

imsart-ejs ver. 2008/01/09 file: ejs_2008_175.tex date: February 2, 2008 



p. Brutti/Warped Wavelet and Vertical Thresholding 



23 



[39] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, second 
edition, 1998. 

[40] V. Maxim. Denoising signals observed on a random design. In Fifth AFA- 

SMAI Conference on Curves and Surfaces, 2002. 
[41] M. Pensky and B. Vidakovic. On non-equally spaced wavelet regression. 

Ann. Inst. Statist. Math. Soc, 53:681-690, 2001. 
[42] D. Picard and K. Tribouley. Adaptive confidence interval for pointwise 

curve estimation. The Annals of Statistics, 28(l):298-335, 2000. 
[43] J. Romberg, H. Choi, and R. Baraniuk. Bayesian tree-structured image 

modeling using wavelet domain hidden Markov models. IEEE Transactions 

on Image Processing, 10(7):1056-1068, 2001. 
[44] S. Sardy, D. B. Percival, A. G. Bruce, H-Y Gao, and W. Stuetzle. Wavelet 

de-noising for unequally spaced data. Statistics and Computing, 9:65-75, 

1999. 

[45] M. L. Stein. Spline smoothing with an estimated order parameter. The 

Annals of Statistic, 21(3):1522-1544, 1993. 
[46] E. Vanraes, M. Jansen, and A. Bultheel. Stabilized wavelet transforms 

for non-equispaced data smoothing. Signal Processing, 82(12):1979-1990, 

2002. 

[47] B. Vidakovic. Statistical Modeling by Wavelets. Wiley-Intcrscience, New 
York, 1999. 

[48] X. W. Wang and A. T. A. Wood. Empirical Bayes block shrinkage of 
wavelet coefficients via the non-central distribution. Technical Report 
03-01, University of Nottingham, Division of Statistics, 2003. 



imsart-ejs ver. 2008/01/09 file: ejs_2008_175.tex date: February 2, 2008 



